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Abstract 

Automatic hypertext generation remains an extremely chal- 
lenging endeavor in the digital library world. In this paper 
we present a solution for automatically connecting relevant 
information in dynamic textual digital libraries. This tex- 
tual information is generally unconnected and often unex- 
plored due to the large flow of information entering from 
remote and local sources. Often, full-text indexes exist for 
this information but embedded links to related information 
are conspicuously absent. Links that do exist are usually 
generated in an arduous and time-consuming manual pro- 
cess. That is why the ability to automatically generate links 
has a potentially high payoff. 

Our solution for the automatic generation of hypertext 
links relies on the techniques of document segmentation and 
document clustering. Hypertext links are automatically gen- 
erated during the document clustering process using the in- 
cremental cover-coefficientrbased clustering algorithm. The 
issues of link completeness and link quality are also ad- 
dressed in this paper, link completeness is studied by com- 
paring the cluster-based approach of link generation to the 
exhaustive link generation approach. Results indicate that 
links are more complete in the higher similarity range than 
in the lower similarity range. Initial Hnk quality user stud- 
ies indicate that the duster-based hypertext link generation 
approach is promising. In the future, we plan to conduct fur- 
ther studies on Hnk quality and investigate ways to increase 
the effectiveness of our approach. 

1 Introduction 

As the countries of the world compete in an ever expanding 
global market, information will become the single most im- 
portant resource for economic growth, national security, and 
education* Much of this valuable information will be avail- 
able in the digital libraries of the world due to the continual 
technological advances in the computer industry and the ex- 
plosive growth of the Internet. Recent studies on Internet 
growth have revealed that the World-Wide Web has tripled 
in size over the seven month period from December 
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to June 1995 [Bou95]. As this ocean of data continues to 
expand, and as advances in electronic communication tech- 
nologies make hypermedia accessibility a reality, new tech- 
niques will be needed to seek out and maintain pointers to 
relevant information. Without the ability to automatical*/ 
connect relevant information, the strategic use of these li- 
braries will not reach full potential. 

The advantages of hypertext as an information retrieval 
tool are well known. Adding a linking layer on top of ex- 
isting indexes can improve a user's chances of finding the 
"right* information. Combining information retrieval tech- 
niques with hypertext empowers a user to issue directed 
content searches or simply browse collections while looking 
for relevant information. For example, documents returned 
from a query may contain links to other documents of inter- 
est that were missed by the original query. 

Automatically generating links between related text is 
an extremely challenging endeavor. In fact, little progress 
has been made since Vannevar Bush introduced hypertext 
to the world. Many projects have attempted to automati- 
cally generate links Using varying approaches [A1195] [Far89l 
ITho91] but have met with limited success. Often their ap- 
proach suffers from tradeoffs such as creation of a system 
that maximizes effectiveness while placing little emphasis on 
efficiency, or creation of a system that only works for static 
libraries, disregarding dynamic digital libraries. The ability 
to automatically and dynamically generate links between re- 
lated documents based on a global view of the collection is 
our goal. 

Another significant objective of this project is integra- 
tion of automatic hypertext linking with current academic 
and commercial systems including Virginia Tech's Envision 
digital library and PRC's Productivity Edge 7 ** document 
management product. Envision allows full-text searching 
and full-content retrieval [Hea95] on a collection of computer 
science literature. It features a unique visualization method 
for displaying the results of a query. In the Envision digital 
library, the automatic Hnking capability will assist students 
and teachers with making the most efficient and effective use 
of the library. Productivity Edge is a flexible data manage- 
ment solution designed to effectively manage business doc- 
uments and engineering drawings - from creation, through 
revision, to distribution and storage. In Productivity Edge, 
the linking capability will allow users to quickly and easily 
locate and access dynamic on-line business information. 

In this paper, we will discuss our solution to the auto- 
matic hypertext link generation problem. We use the tech- 
niques of document clustering and document segmentation 
to solve this problem. The paper is organized in the follow- 
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ing manner: 

1. A brief overview of the current research in the field of 
automatic hypertext fink generation. 

2. Our solution to the fink generation problem. 

3. A discussion of the experiments to test the validity of 
our approach and the analysis of the results. 

4. Future directions for our research. 
2 Background 

In 1965, Nelson coined the word hypertext and defined it as 
"a body of written or pictorial material interconnected in a 
complex way that it could not be conveniently represented 
on a paper. It may contain summaries or maps of its con- 
tents and their interrelations; it may contain annotations, 
additions and footnotes from scholars who have examined 
it" r*Jc!65]. 

•hypertext document, in terms of our project, is made 
up r.-% nodes and finks. Nodes are the actual content of 
the document and may contain text, figures, tables, au- 
dio, video, and other forms of data. Links connect related 
nodes, where each node fa the source or destination of a link. 
Links can be bi-directional or unidirectional, and links can 
be typed. Typed links [Tri87j make the relationship between 
the nodes explicit. 

The need for automatic link generation was realised as 
early as 1945 by Vannevar Bush even before the term hyper- 
text was coined, m his seminal article °As We May Think", 
Bush describes a device called the Memex that has the ca- 
pability to connect two related documents, "It affords ah 
immediate step, however, to associative indexing, the ba- 
sic idea of which is a provision whereby any item may be 
caused at wiD to select immediately and automatically an- 
other" [Bus45]. 

Research in automatic hypertext generation uses tech- 
nique from pattern matching, information retrieval, natu- 
ral language processing, and neural networks. Past research 
deals with the conversion of the marked- up document collec- 
tions into hypertext. For documents that are not marked- 
up, pattern matching schemes were used to generate links. 
Current research deals with the conversion of plain text to 
hypertext using advanced techniques from information re- 
trieval, natural language processing, and neural networks. 
A brief overview of the various link generation methodolo- 
gies are given below. 

2.1 Link detection based on pattern matching 

In this approach, links are generated using keyword match- 
ing. Synonymy and polysemy Emit the usefulness of this 
approach (Tho9l]. Synonymy, multiple words with same 
meaning, is a cause for links being undetected, while poly- 
semy, a single word with multiple meanings, is a cause for 
the generation of poor Snks. In addition, keyword-based ap- 
proaches generate redundant links. For example, consider 
building links from a document to a dictionary. If a key- 
word occurs several times within a section, only the first 
instance of the keyword should be linked to its definition in 
the dictionary. Otherwise the text will be dominated by the 
hypertext links and readability of the document will suffer. 



2.2 link detection based on document mark-up 

In this approach, the mark-ups in the document are used to 
generate links [Tbo91]. Most links generated by this method 
are referential in nature. links can be generated from the 
references of the object to the actual object. Structural 
and hierarchical links can also be generated using tins ap- 
proach. For example, a table of contents for a document 
can be easily generated from the mark-ups of the section 
and the sub-section headings of the document. A problem 
with this approach is that not all documents available in 
digital libraries are marked up. 

23 Link detection based on information retrieval and vi- 
sualization 

James Allan [AH95] describes techniques for fink detection 
and typing using information retrieval (IR) and visualiza- 
tion, link types are detected by the steps given below. 

• Decompose documents into smaller sub-parts - e.g., 
sections, paragraphs. 

• Determine similarity between each sub-part of the first 
document and each sub-part of the second document. 
Remember all pairings that have similarity values above 
a certain threshold. 

• Generate links between document pairings that have 
similarity values above a certain threshold. 

• Identify patterns with these links, and use those pat- 
terns to describe the type of link. 

Though this approach seems promising, it cannot be used 
interactively due to efficiency considerations [AU95], and is 
therefore not suited for digital libraries. The same is true 
for other advanced methods such as natural language pro- 
cessing and neural network-based methods. In some cases 
existing links must be regenerated, requiring the entire fink 
generation process to be repeated, when a document is in- 
crementally added, modified, or deleted. In the next section 
of this paper we propose a solution for the automation of 
hypertext generation with the above mentioned problems in 
mind. 

3 Approach 

Several problems exist with current hypertext link genera- 
tion methods: (l) manual construction of hypertext finks 
between related documents in a digital library is expensive 
[Fox91] [Tho9l) fSa!94j; (2) links are often generated by a 
small number of authors or other users, without knowledge 
of the breadth of relevant information in the collection; (3) 
links are generally created once and often precede the addi- 
tion of new data to the collection; (4) following these finks 
presupposes that what is relevant information for one per- 
son is relevant for everyone; (5) the amount of information 
that will flow into digital libraries is overwhelming to users. 
Therefore, it is imperative that engines are developed to au- 
tomatically generate links, based on a system view of the 
information. 

Our system generates hypertext links in an innovative 

way by using clustering as the basis for link creation. We use 
the cover-coeffici en t-based incremental clustering methodol- 
ogy (CflCM) [Can93] to generate links between the doc- 
ument (document sub-parts) pairs of each cluster. We use 
C*iCM because it can handle large collections [Can95], it 
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is dynamic in nature, and it produces statistically valid dus- 
ters compared with those of re-dustering algorithms. 

The test collection consists of an ASCII text version 
of the ACM Hypertext Compendium [ACM91] and other 
heterogeneous documents stored in the PRC Productivity 
Edge document management system, which includes the 
Virginia Tech research software. Inside a given collection, 
such as the ACM Hypertext Compendium, we assume a struc- 
ture of three levels: document, paragraph, and sentence. 
Users of our system can control the indexing to produce 
vectors at any or all of these levels. We have built our own 
indexing and subsequent vector processing routines around 
the Fulcrum Ful/Text™ system. 

C 2 SCM is used to cluster the documents together. The 
automatic link generation phase is embedded within the 
clustering phase. Document pairs with a similarity above 
a given threshold and within each cluster are identified as 
candidate links. One important attribute of links is the sim- 
ilarity value between the source and destination of the links. 
Henceforth this attribute wiD be referred to as link similar- 
ity. The Productivity Edge database stores the links with 
other document attributes. 

3.1 Incremental clustering and link generation 

C 7 JCM is a seed-based partitioning type clustering scheme. 
An advantage of this scheme is that we can predict the 
number of clusters using the cover-coeffieiemVbased concept. 
This method of predicting the number of clusters agrees with 
the hypothesis that the number of dusters within a docu- 
ment collection should be low if the individual documents 
are dissimilar, and high otherwise. In the case of G^ICM, 
the order of document addition does not affect the outcome 
of the clustering process. 

C 7 TCM can be broken down into two different phases: 
the cluster seed selection phase and the duster construction 
phase. In the first phase an estimation of the number of 
clusters and the document seeds is determined using the 
cover coefficient concept. In the second phase the actual 
clustering is completed. It is only during the second phase 
that fink generation is completed. The detailed steps in the 
second phase are explained below. 

1. The seed documents from phase 1 are sorted by docu- 
ment number. 

2. If this b the first run of the link generation algorithm, 
then slop to step 7. 

3. For each seed document 5 in the previous clustering 
structure, if $ becomes a non-seed document in this 
increment, then the cluster containing the document 
S is falsified. 

4. For each new seed S, if S is an old document and S was 
a non-seed document in the previous duster structure, 
then the duster containing document S is falsified. 

5. For each link in the old link set, if the source or the 
destination of the fink is one of the documents (docu- 
ment sub-parts) in the falsified dusters, then the link 
is deleted. 

6. Cluster all documents that belong to the falsified dus- 
ters. If document D is added to duster C, then new 
links are formed between D and members of the dus- 
ter C if they have a similarity value above a certain 
threshold. 



7. Cluster all the new documents with the seed docu- 
ments. For each document D that is being added to a 
duster C, links are formed between D and the mem- 
bers of C if they have similarity value above a certain 
threshold. 

3.2 Illustration of fink generation 

Figure 1 and Figure 2 illustrate the link generation process. 
The tree-tike structure in the figures has the documents de- 
composed into paragraphs and sentences. All the documents 
and document sub-parts are given identifiers. The D T s in the 
figures are document identifiers, the P's arc paragraph iden- 
tifiers, and the S's are sentence identifiers. In this example, 
both the document (D) and the document sub-parts (P and 
S) are considered for clustering and link generation. As dis- 
cussed above, links are generated between pairs in a cluster, 
with similarity values above a certain threshold. 

Document Structure 

Dl 

SI S2 S3S4 S3 S6 

Ouster and Link Information 

Seeds Non-Seeds 

51 P1,S3,S4,P2 
P3 Dl.S5.S6 

52 - none — 

Links: 

(SI, S3) (S3, P2) (SS, S6) (P3, S5) 

Figure 1: link Generation: First Increment 

figure 1 shows the first increment of the dustering and 
link generation process. Document Dl is added to the col- 
lection in this increment. The dustering and the link gen- 
eration process results in three dusters and Tour finks. The 
number of finks is four because only four document pairs 
have a similarity value above the specified threshold. 

Figure 2 shows the second increment of clustering and 
fink generation. In this increment, document D2 is added 
to the collection. During this increment, the non-seed docu- 
ment sub-part S3 t of the first increment, becomes a new seed 
and a seed document sub-part 52, of the first increment, be- 
comes a non-seed. As a result, the clusters containing both 
of these document sub-parts are falsified. This results in the 
deletion of links that are associated with these dusters. The 
falsified documents (and document sub-parts) and the new 
documents (and document sub-parts) are then re-clustered. 
The fink generation process results in the dynamic addition 
of links. Lank maintenance and collection browsing are dis- 
cussed in the next section. 
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Document Structure 
Dl D2 

/K /\ 

SI S2 S3 S4 S5 S6 S7 S8 S9 S10 SI1 





Cluster and Link Information 


Seeds 


Non-Seeds 


SI 


P1,P2,S4,S10 


S3 


S7. S8.P4 


P3 


D1.S5.S6 


S9 


SIO.SU 


PS 


D2,S2 



Falsified Clusters: 

Clusters which has SI, S2 as seeds. 

Links deleted: 
(SI, S3) (S3, P2) 

Added Links: 

(S!.S10)(S3.S8)(S9,S11) 
Resulting Links: 

(SI , SIO) (S3, S8) <S5, S6) (S9, SI 1) (P3 f S5) 



Figure 2: link Generation: Second Increment 



3 J Link maintenance 

Hypertext rinks are typically represented in one of three 
ways: embed all the fink information within the document 
(as in HTML); embed a persistent fink id within the docu- 
ment but store the link information externally; or store the 
link information externally to the document f)av95]. The 
last approach has the advantage that the documents need 
not be modified when links are added or deleted. This ap- 
proach also supports the navigation of the collection by any 
browser, since, document formats that are specific to that 
browser can be built on the fly. Also, this method supports 
different views of the same collection. 

In our system, links are stored externally m a relational 
database. Our text collection can be navigated using a 
World-Wide Web browser such as Netscape. Traversal of 
a link causes the retrieval of a document from Productivity 
Edge, the creation of an HTML document on the fly, and 
the automatic insertion of links into the document. The 
next section discusses the usefulness of our link generation 
approach. 



4 Experimental Design and Evaluation 

In this section we present two sets of experiments to study 
the usefulness of the cluster-based link generation process. 
In the first set of experiments we compare the duster-based 
link generation process with an exhaustive method of fink 
generation. By exhaustive, we mean taking a document 
(document sub-part) in the collection and comparing it to 
all other documents (document sub-part) in the collection. 
Links are formed between pairs that have a similarity value 
above a certain threshold. Links that are d e tec t ed by the 
exhaustive approach constitute the complete set. This is be- 
cause there is no scheme, other than manual, that generates 
a more complete set of links than the exhaustive method. 

In the second set of experiments we compare the duster- 
based approach to the manual methods. The manual method 
of generating Hnks is impractical for large coUections. There- 
fore, this study is performed on a smaller scale than the first 
set of experiments. This also provides a way to study the 
quality of Hnks generated by the cluster-based approach. 

4.1 Study of fink completeness using exhaustive approach 

The ratio of the number of hnks generated by the duster- 
based approach to that of the number of hnks generated by 
the exhaustive approach can be used as a metric for Hnk 
completeness. Our hypothesis is that the cluster-based ap- 
proach should detect most of the hnks found by the exhaus- 
tive approach for higher similarity values between the source 
and destination of the link. So, we predict that finks will 
be more complete in the higher similarity range than in the 
lower similarity range. This is because the clustering algo- 
rithm groups documents that are more similar to each other 
in one cluster and our algorithm generates links only within 
the same cluster. 

We ran the experiments on a sample collection of 9 doc- 
uments from the ASCII version of the "Hypertext Com- 
pendium" [ACM91]. The sample documents are listed be- 
low: 

1. htcLtxt: "A Hypertext Model Supporting Query Mech- 
anisms" by Foto Afrati and Constantinos D. Koutras. 

2. htc2.txt: °KMS: A Distributed Hypermedia System 
for Managing Knowledge in Organizations" by Robert 
M. Akscyn, Donald L. McCracken and Ehse A. Yoder. 

3. htcl0.txt: "Browsing in Hyperdocuments with the As- 
sistance of a Neural Network** by Frederique Bienmer, 
Michel Guivareh and Jean-Marie Pin on. 

4. htc28.txt: "A Retrieval Model for Incorporating Hy- 
pertext links* by W. Bruce Croft and Howard Turtle. 

5. htcl00.txt: "From Ideas and Arguments to Hyperdoc- 
uments: Traveling through Activity Spaces* by Nor- 
bert A. Streitz, J org Hannemann, and Manfred Thur- 

ing. 

6. htcl02.txt: B A Visual Representation for Knowledge 
Structures 9 by Michael Travers. 

7. htcl08.txt: "Links and structures in hypertext databases 
for law* by Eve Wilson. 

8. htcll5.txt: "Hypertext and Information Retrieval: What 
are the Fundamental Concepts?** by W. Bruce Croft. 
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9. htcll6.txt: "Hypertext and Higher Education: A Re- 
ality Check" by Stephen C. Ehrmann, Steven Erde, 
Kenneth Morrell, and Ronald F. B. Weissman. 

We indexed the collection at three different levels: sen- 
tence, paragraph, and document. This resulted in 3633 doc- 
ument and document sub-parts. We then ran the duster- 
based scheme and the exhaustive scheme to generate links. 
The similarity values in our system range from zero to one 
and the similarity threshold value used for link generation 
was 0.25. We chose this value because we observed that 
bad Knks tend to dominate the link set for lower similarity 
values. 

4.1.1 Experimental design 

Let y denote the percentage of links determined by the 
cluster-based approach when compared to that of the ex- 
haustive link approach. We want to verify that the selection 
of similarity range has a major impact on the vahie of y. 
The other factor that might affect the value of y is the gran- 
ularity levels of the source and destination nodes of the link- 
Since we have two factors to study, we have decided to use 
a 7? experiment design [Jai9l] to study the effects of the 
factors on the performance. In this experiment we are in- 
terested only in quantifying the relative contribution of the 
factors to the response variable 

The nonlinear regression model for this experiment is: 

V = gt> + qA*A + <IBXB + qAB*A*B (i) 

where z's denote the factors, q y s denote the effects, y is the 
response variable, and the subscripts A and B identify the 
two factors [Jai9l]. 

The experiments will determine the effects of the factors 
and of their interactions on the response variable. We wiD 
also determine the contribution of the factors to the total 
variation of y, thereby ganging the importance of the factors. 
The total variation of y, denoted by SST, is related to the 
squares of the effects by 

SST = SSA + SSB + SSAB (2) 

where 

SSX = 2 7 qx 7 ,X €iA.B, AB}~ (3) 

The fraction of variation explained by a factor X is the 
ratio SSX/SST. 

4.1JZ Factor analysis 

The two factors, their corresponding symbols, and their level 
assignments are shown in Table 1. Since we use a 2 s factorial 
design, and the similarity values for the finks range from 
0.25 to 1.00, we divided this range into two sub-ranges and 
used them as two different levels for that factor. For the 
link granularity factor, one choice was to study all the links. 
For the second level, we chose sentential finks. Sentential 
links are those links where both the source and destination 
nodes are sentences. In our approach, terms within shorter 
sentences are generally assigned higher weights, due to the 
normalized weighting scheme. The effects of this on the 
response variable can be studied by choosing the second level 
of the granularity factor to be sentential links. 

The sign table and the measured responses are shown 
in Table 2. In this table EJLINKS denote the number of 



Table I: Factors and their levels 



Symbol 


Factor 


Level -1 


Level 1 


A 
B 


Similarity Range 
Link Granularity 


(0.25, 0.6J 
All links 


(0.6, 1.0] 
Sentential finks 



Table 2: Measured Response 



1 


A 


B 


Ab 


ELL1NKS 


C-LINKS 


y 


1 


-1 


-1 


1 


10399 


1268 


12.19 


1 


1 


-l 


-1 


2132 


1223 _j 


57.36 


1 


-1 


l 


-1 


4542 


563 


12.39 


1 


1 


1 


1 


1208 


565 


46.77 



links that are generated by the exhaustive fink generation 
approach and C -LINKS denote the number of links that are 
determined by the cluster-based approach. The values of the 
response variable y are obtained by running the clustering 
and the exhaustive schemes for four different combinations 
of the levels of factors A and B. 

The effects, computed using Table 2, are shown in Ta- 
ble 3. Table 3 also contains the percentage of variation ex- 
plained, which is computed using (2) and (3). The results 
show that most of the variation of the response is due to the 
contribution of factor A — the similarity range. This fac- 
tor contributes 96.6 % to the total variation of the response 
variable. The granularity level and the interaction of the 
two factors do not have a comparable effect. 

4.1.3 1 Interpretation of the results 

The results of our experiments show that the similarity range 
has a significant impact on the response variable. Also, the 
result hints that in the higher similarity range the duster- 
based algorithm determines a higher percentage of links 
when compared to the lower similarity range. Table 2 il- 
lustrates this fact. Whenever the value of variable A (sim- 
ilarity range) is -1 (lower similarity values), the percentage 
of finks that are determined by the dustcring approach is 
low when compared to the result when the value of A is 1 
(higher similarity values). This is true irrespective of the 
value of B. 

To verify the above result, some informal experiments 
were conducted. We recorded the percentage of links that 
were determined by the clustering approach for various sim- 
ilarity ranges. Table 4 shows the results of our experi- 
ment. This result verifies the outcome of the first experi- 
ment. From the table it is dear that as the similarity range 
increases, the percentage of links detected by the clustering 



Table 3: Mean Effects an 


dAflpcation of Variation 


Effects 


Mean Estimate 


Variation Explained (%) 


9> 


128.72 


NA 


9A 


79.54 


96.61 


VB 


10.39 


1.6 


Qab 


10.79 


1.74 
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Table 4: Similarity Ranges vs link Percentages 



Similarity Range 


tiJbMKS 


CJJNKS 


link Percentage 




4335 


403 


9.29 


(□.3 , 0.4J 


4068 


434 


10.66 


(0.4 , 0.5J 


1226 


222 


18.10 


fo.5 , O.o] 


770 


209 


27.14 


(0.6 , 0.7] 


591 


58 


9.80 


f 0.7 , 0.8J 


92 


35 


38.04 


(0.8 , 0.9J 


^ 47 


29 


61.70 


(0.9 , 1.0J 


1492 


1101 


78.53 



Table 5: link Quality Results 



# of manual 


# of cluster 


Link % 


Similarity 


#of 


links 


finks 




Threshold 


documents 


32 


107 


34 


0.35 


15 


32 


135 


43 


0.30 


15 


32 


181 


43 


0.25 


15 


32 


56 


41 


0.35 


3 


32 


72 


41 


0.30 


3 


32 


89 


43 


0.25 


3 



scheme increases, except for the similarity range (0.6, 0.7]. 
The higher Dumber of links in the similarity range (0.9, 1.0} 
can be attributed to the short sentences in the documents. 
In our study we found that these links are mostly of the 
referential type. This includes links between the quotes in 
an article and the actual article, and links between the ci- 
tations and the reference section of the article. The next 
section compares the duster-based fink generation method 
to the manual method of link generation. 

4.2 Study of fink completeness using manual approach 

The purpose of this study is to compare the quality of finks 
generated using our clustering approach with hnks gener- 
ated by hand. In the above study of hnk completeness, our 
Imkrag method was compared against an exhaustive link 
generation approach for a relatively small collection. The 
reasons for selecting an exhaustive approach over a manual 
approach for a ground truth become quite apparent when a 
collection consists of hundreds of documents. Since a man- 
ual ground truth does not exist for this collection, or any 
hypertext collection to our knowledge, we chose to manu- 
ally generate hnks between only 3 documents for this test. 
Because of our familiarity with the Hypertext Compendium 
we knew that the following three documents were similar in 
content: 

1. btcl6.txt: a As We May Think* by Vannevar Bush. 

2. ntc70.txt: "Hypertext: Does It Reduce Cholesterol 
Too?" by Norm Meyrowitz. 

3. htc43.txt: "Information Retrieval From Hypertext: Up- 
dated by Mark RFrisse et aL 

4.2.1 Experiments and observations 

The main goal of this test is to compare our approach to 
the manual approach of fink generation and evaluate link 
quality. 

In order to complete this test, we created a full- text col- 
lection consisting of 15 documents, including the above doc- 
uments. We then automatically generated links using the 
duster-based approach at a similarity threshold of .35- For 
the 15 documents, 852 dusters were created and 4,348 hnks 
were generated. Our next step was to manually build links 
from the article "As We May Think" to the other two rele- 
vant documents in the collection. We chose Vannevar Bush's 
article as our base since he is often quoted and rited in the 
other two documents. The manually generated links from 
the 3 relevant documents were compared with the links from 



our dustering approach. Analysis of the results show that 
34% of the manually generated links were found by the dus- 
tering algorithm. The tests were then run at a similarity 
level of .30 and. .25. In both cases, the number of manual 
links found improved to 43%. 

In the next test we created a collection consisting of only 
the 3 relevant documents and automatically generated hnks. 
This additional test was completed to ensure that variation 
of the collection size has no impact on the quantity of links 
generated. For the new collection, 398 dusters were created 
and 332 links were generated. Comparison of these links 
with the above method resulted in an improvement to 41% 
at a similarity levd of 35, with a slight decrease in perfor- 
mance to 41% at the .30 level, and performance remaining 
the same at the .25 leveL Table 5 contains the link compar- 
ison results for these tests using Vannevar Bush's article as 
our base (e.g., 32 manual links were created for bis article). 

4.2.2 Interpretation of results and study of link quality 

Careful evaluation of the finks missed by our algorithm re- 
sulted in the discovery of three common problems: mis- 
spellings, document segmentation errors, and misinterpreted 
punctuation. Differences in the spelling of words such as 
"Encyclopedia of Britannica* and "Encyclopedia of Brittan- 
ica" or "Cabinet Maker" and "Cabinetmaker" between the 
documents caused several links to be missed. The parser also 
encountered document segmentation problems with short 
single fines of text that did not contain proper ending punc- 
tuation (i.e., T , T , or V ). Often these lines of text 
were treated as both paragraphs and sentences. The terms 
in each line generally carry a high weight, causing hnks to be 
generated. This occurs most often when section headers are 
encountered such as "Introduction** or "Section 5". Finally, 
problems occurred with the breaking up of citations into 
smaller sentences and difficulties with punctuation embed- 
ded in sub-parts such as quotes and ellipses. Where found, 
misspellings and punctuation errors were corrected in the 
document, and the tests re-ran in order to better evalu- 
ate the hnk generation software - our results reflect these 
changes. As yon will note in Table 5, more Hnks were gen- 
erated by the automatic method than the manual approach. 
These additional finks were often finks to destinations within 
the same document or finks resulting from document seg- 
mentation errors. Development of a robust parser will im- 
prove the quality of generated hnks by reducing the number 
of irrelevant links and by not missing relevant finks due to 
punctuation. 

It is our intent to improve the parser and follow up this 
test with experiments consisting of several users and a larger 
collection. As we have stated, manual generation of finks is 
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very time consuming especially when a document collection 
is large. The identification of a ground truth collection with 
relevant links generated between documents at the sentence 
and paragraph level would be a great asset to our link quality 
tests. 

S Conclusion and Future Directions 

A new method for generating automatic hypertext finks was 
introduced in this paper. This method is both efficient and 
dynamic to meet the demands of the ever-growing digital li- 
brary. Document clustering was used as the basis of the au- 
tomatic link generation process. The fink generation process 
is embedded within the clustering process. The uniqueness 
is that the clustering technique is applied not only to docu- 
ments but also to document sub-parts. links are generated 
between document pairs that have a similarity value above a 
certain threshold, and the source and the destination nodes 
of the links always reside in the same cluster. 

Experiments were performed to study link completeness 
and fink quality. For studying fink completeness, we com- 
pared the duster-based fink generation approach to the ex- 
haustive method of link generation. Results indicate that 
the cluster- based link generation approach performs better 
for finks that have higher similarity values. In particular, 
this method works well for referential fink types. 

In the future, we plan to perform more extensive stud- 
ies on link quality including comprehensive comparisons of 
the automatic and the exhaustive link generation approach. 
Also, we plan to integrate the automatic fink generation sys- 
tem with the Envision digital Hbrary of Virginia Tech and 
study how this helps faculty and students who use this li- 
brary. 
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